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TEST “RELIABILITY”: ITS MEANING AND DETERMINATION 


LEE J. CRONBACH 
UNIVERSITY OF CHICAGO 


The concept of test reliability 1s examined in terms of general, 
group, and specific factors among the items, and the stability of 
scores in these factors from trial to trial. Four essentially different 
definitions of reliability are distinguished, which may be called the 
hypothetical self-correlation, the coefficient of equivalence, the co- 
efficient of stability, and the coefficient of stability and equivalence. 
The possibility of estimating each of these coefficients is discussed. 
The coefficients are not interchangeable and have different values in 
corrections for attentuation, standard errors of measurement, and 
other practical applications. 


The literature of testing contains many discussions of test re- 
liability. Each year, new formulations are offered, and new proce- 
dures tor estimating reliability are championed. There appears to 
have developed no universally accepted procedure, and several writ- 
ers have attributed this difficulty to the diversity of definitions for 
reliability now in use. It has often been suggested that perhaps the 
only effective way to resolve the conflicts among contending view- 
points is to replace the term “reliability,” recognizing that it covers 
not one, but several concepts. The present paper attempts to restate 
the conflicting concepts and assumptions now current, and to offer a 
scheme for separating the various aspects of dependability of meas- 
urement. 

The physical scientist. generally has expressed the accuracy of 
his observations in terms of the variation of repeated observations 
of the same event. The mean of the squared deviations of these ob- 
servations about the obtained mean is the “error variance.” This is 
a measure of precision or reliability. If for the present we regard 
reliability as the consistency of repeated measurements of the same 
.event by the same process, two fundamental differences between the 
problem of the physical scientist and the psychologist appear. The 
physical scientist makes two assumptions, both of which are ade- 
quately true for him. First, he assumes that the entity being meas- 
ured does not change during the measurement process. By control- 
ling the relevant conditions—and he usually knows what these condi- 
tions are and can control them—he can hold nearly constant the 
length of a rod or the pressure of a gas. When measuring a variable 
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quantity, where his assumption is no longer valid, he abandons the 
method of successive observations and employs instead simultaneous 
observations. The psychologist cannot obtain simultaneous measure- 
ments of behavior, yet the quantities that interest him are always 
variable and the method of successive measurements requires an im- 
possible assumption. The psychologist may wish to measure a hypo- 
thetical constant (aptitude, or a limen), but all he can ever observe 
is behavior, which is always shifting. It is one thing to test the ac- 
curacy of measurement of a quantity, quite another to test whether 
that quantity is constant. Judgment on the second question must 
await judgment on the first. 

The second assumption of the physical scientist is that his meas- 
urements are independent. If one rules out his remembering prior 
measurements, this assumption can usually be made true. Successive 
measurements of psychological quantities are rarely independent, 
however, because the act of measurement may change the quantity. 
London (10) has recently described this difficulty by the physicist’s 
term hysteresis. 

The reliability of a test score has generally been defined in terms 
of the variation of scores obtained by the individual on successive 
independent testings. Neither the assumption of constancy of true 
scores nor the assumption of experimental independence is realized 
in practice with most psychological variables; therefore, the reliabil- 
ity of a test, as so defined, is a concept which cannot be directly ob- 
served. If there is no standard of truth, it is fruitless to compare one 
estimate with another and debate which is more correct. But by vari- 
ous assumptions which usually cannot be tested, we obtain usable sta- 
tistics which describe the test. Different assumptions lead to differ- 
ent types of coefficients, which are not estimates of each other. In 
particular, as many writers have noted, an estimate of the stability 
of a test score is not at all the same as an estimate of the accuracy 
of measurement of behavior at any one instant. Jenkins cites Fran- 
zen’s comments on certain physiological measures which have high 
split-half “reliabilities” and low retest “reliabilities” (6). The meas- 
uring technique may be extremely accurate in reporting a biological 
instant in the life of an individual but not measure a stable character- 
istic of the individual. 

Both the physicist and the psychologist encounter the problem 
of observer error. In sighting through a telescope or scoring an es- 
say test, there is likely to be appreciable constant and variable error 
in observing. If one compares several judgments by the same ob- 
server, he includes the variable errors of observation with the errors 
of measurement. Hence, he studies the reliability of “this measuring 
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instrument used by this man.” If scores obtained by several observ- 
ers in simultaneous measurements are pooled for comparison, the 
constant error of each man is included as a source of variation» This 
procedure studies the reliability of “this measuring instrument used 
by different men.” Since the human takes part in the measurement, 
one cannot study the reliability of an instrument apart from the men 
who use it. 


Types of “Reliability” 
It is known that 


iis 

ry =1—-—, (1) 
or 

where 7; is the reliability coefficient, o.2 is the hypothetical error 

variance—the mean of the squared deviations of all obtained scores 

for each person from the mean obtained score for that person—and 

o;? is the variance of the scores of all persons on all the hypothetical 

independent trials. 

It is convenient to consider the possible definitions of error of 
measurement in terms of variance. Using a bi-factor pattern to de- 
scribe a test,* the variance of scores from a single testing may be 
expressed as follows: 


o,? =a0,? + oy. ae of i gh os" + oe? +++ + 0,7 + oe?. (2) 


The terms have the following meanings: 


o,” is the variance of obtained scores; 

oc,” is the variance in the general factor (if any) represented in 
the test items; 

oj? , o)°, etc. , are the respective variances in the orthogonal group 
factors of undetermined number, each of which is represented in two 
or more items; 

Gs", o,”, etc., are the respective “specificities” of the n items— 
the part of the reliable variance of scores on the items which cannot 
be assigned to common factors; and oe" like the residual variance. 


The referents for these factors may be illustrated in a hypo- 
thetical examination in psychology. The general factor might include 


general knowledge of psychology, reading ability, motivation, and 


* Another factor pattern could be assumed without changing the basic argu- 
ment (4, 7-9, 107). 
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other characteristics. Group factors might be related to knowledge 

of separate topics, mathematical skill required in only a few items, 

and so on. Each item taps, in addition, some specific knowledge not 

demanded by other items. The specificity variance accounts for in- 

dividual differences in these elements. The remaining variance may 

include momentary inattention, guessing, and other random elements. 
For reference, the formula will be rewritten thus: 


o2—07 + do? +> oe" oF oe". (3) 


Consider now the scores obtained from a series of independent 
measurements of the same individuals using the same test. 


eee te tee te te, eee + oe’. (4) 
Gr le Siz Ix ? Sic 
o:? is the variance of all obtained scores about the grand mean; 


so? is the variance of the mean general factor scores of all indi- 


Gz 
viduals about the mean for all individuals—the between-persons vari- 
ance in g; 


o° is the between-persons variance in a group factor; 
Ie 
a is the between-persons variance in specificity on any item; 


> o,° is the sum over individuals of the variances of the general- 


factor scores for each individual about the mean for that individual— 
the within-persons variance; 


BS o/? and SJ & a,’ represent the corresponding within-persons 
z iw 
variances in the group factors and specificities, respectively ; and 


oe" is the residual variance. 
The between-persons variances represent, as in the case of the 
single trial, individual differences in the factors. The within-persons 
variances represent instability of scores for each individual, as a re- 
sult of changes from test to test. 

These formulations permit an exact statement of what a “reli- 
ability coefficient” represents. Apparently at least four fundamen- 
tally different meanings of reliability are current: 


(1) The “error variance” may be permitted to include, in equa- 
tion (4), the terms > o,?, ape o/?, ty os; and oe’. That is, insta- 
bility is regarded as an error of measurement. This is the coefficient 
defined by the correlation from repeated independent administrations 
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of the same test. The assumption of constancy is made, since any 
change of score from trial to trial is treated as an error of measure- 
ment. If that assumption is true, the instability terms vanish, but 
such constancy in all the behaviors a test measures is highly unlikely. 


(2) The “error variance” may be permitted to include, in equa- 
tion (4), the terms So? ,> o Pe abe or »LD Do ,and oe". Both 


instability and specificity are treated as errors. This is the “reliabil- 
ity” defined by the correlation between successive independent admin- 
istrations of equivalent tests. Because different items are used in 
preparing equivalent forms, the specific-factor scores of individuals 
on the two tests will be uncorrelated. These, therefore, contribute to 
changes in score and are treated as error. If the tests do not repre- 
sent the same group factors, at least part of » * is also added to the 


error variance. 


(3) The “error variance’ may be permitted to include in equa- 
tion (3), the terms os. and oe". This defines “reliability” as the cor- 


relation between two equivalent tests administered simultaneously. 
Instability is excluded from consideration, and no assumptions of con- 
stancy are made. Specific-factor variances are included in errors of 
measurement. Depending on the degree of equivalence, part of the 
group-factor variance may also be treated as error. 


(4) The “error variance” may be restricted, in equation (3), 
to the term oe*. This is “reliability” defined as the self-correlation of 
a test (see below). No assumption of constancy is made, and inde- 
pendence is not involved. The specific factors remain the same from 
test to test and are added to the true-score variance. All real vari- 
ables measured by the test are treated as quantities estimated, not as 
errors. 


It may now be helpful to restate these definitions and to give 
them names for reference. 


Definition (1): Reliability is the degree to which the test 
score indicates unchanging* individual differences in any 
traits. (Coefficient of stability). ‘ 


Definition (2): Reliability is the degree to which the test 
score indicates unchanging individual differences in the gen- 
eral and group factors defined by the test. (Coefficient of 
stability and equivalence). 


* This may be modified by requiring constancy over some specified period 
(one year, one day, etc.) 
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Definition (3): Reliability is the degree to which the test 
score indicates the status of the individual at the present in- 
stant in the general and group factors defined by the test. 
(Coefficient of equivalence). Internal consistency tests are 
generally measures of equivalence. These coefficients pre- 
dict the correlation of the test with a hypothetical equiva- 
lent test, as like the first test as the parts of the first test 
are like each other. 


Definition (4): Reliability is the degree to which the test 
score indicates individual differences in any traits at the 
present moment. (Hypothetical self-correlation). 


These names are open to criticism, and better suggestions are in or- 
der. The important thing is to recognize that in the past all four of 
these and many approximations to them have been called “the reli- 
ability coefficient.” No one of these is the “right” coefficient. They 
measure different things, and each is useful. What is important is 
to avoid confusing one with another, and using one as an estimate of 
another. It may be noted that reliability of a test can only be dis- 
cussed in relation to a particular sample of persons. 

The components of error variance under each definition imply 
that in practice some coefficients will be larger than others for a giv- 
en test. If stability is not perfect, and if items contain some spe- 
cificity loading, the hypothetical self-correlation will be greatest, and 
the coefficient of stability and equivalence will be the smallest of the 
four. 

As Kelley states (7), the concept of reliability is meaningless un- 
less one postulates that two measures of the same function exist. 
They may be successive measurements of a stable event, or simul- 
taneous measurements of a unique event. But in regard to the non- 
repeating event which can be observed only once, reliability has only 
a theoretical interest. In fact, if one accepts a deterministic position, 
there is no “error” in a measurement of a unique event. The stu- 
dent’s responses and his score are determined by many forces, and 
we do not know what they are; but the resultant of these forces is a 
particular act, and the act itself, at this instant and with these par- 
ticular forces, is perfectly reliable. “Chance” and “error” are merely 
names we give to our ignorance of what determines an event. 

All methods of studying reliability make a somewhat fallacious 
division of variables into “real variables” and “error.” It is prob- 
baly more correct to conceive a continuum between the instantaneous 
behavior which has an infinitesimal period, through states of longer 
duration, to the virtually constant individual differences. A test score 
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is made up of all these “real” elements, each of which could be per- 
fectly predicted if our knowledge were adequate. Reliability, accord- 
ing to this conception, becomes a measure of our ignorance of the 
real factors underlying brief fluctuations of behavior and atypical 
acts. Perhaps a new statistical method based on the non-Aristotelian 
conception of a continuum of realities will some day permit us to 
avoid the troublesome attempt to divide the continuum into “reality” 
and “error.” 

For the present, it appears to be necessary to retain the artificial 
separation. In thinking about the self-correlation of a test—the con- 
sistency with which it measures whatever it measures—we may class 
as chance effects all variables whose period of variation is shorter 
than the time required to take the test. Momentary fluctuations are 
therefore “errors,” but shifts in fatigue, set, or skill having a longer 
cycle are possibly worth measuring. 


Techniques of Estimation 


Each method used in the past to study “reliability’’ may be asso- 
ciated with one of these definitions. The procedures requiring more 
than one trial will be discussed first. 


Retest method. The retest method calls for giving the same test 
twice to the same group. The trials are supposed to be independent, 
but this may well not be true. Shift in relative scores is always treat- 
ed in the error variance, not the true-score variance; the retest co- 
efficient is therefore an estimate of the coefficient of stability. Fail- 
ure to attain independent trials may make the estimate too high or 
too low. 

Guttman (3, 263), in a complete reconsideration of reliability 
theory, defines reliability in terms of the stability of individual dif- 
ferences during a large number of “independent” retests. He shows 
that the reliability thus defined (a coefficient of stability) may be 
estimated by the correlation between two independent trials. His def- 
inition of independence will be discussed below. 


Equivalent tests method. Two “equivalent” or “parallel” tests 
may be given, with any interval between, and their correlation de- 
termined. Experimental independence is assumed, despite the effect 
experience with one form may have on the second. Constancy is as- 
sumed, and all shifts in relative score are treated in the error vari- 
ance. Specific-factor variances are treated in the error variance. This 
is therefore an estimate of the coefficient of stability and equivalence. 
Because the assumption of independence cannot be tested, it is never 
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known whether the estimate is high or low. To interpret a coefficient 
involving equivalence, one must know how the tests are equivalent. 
If the tests are alike only in the general factor, group-factor vari- 
ances are included as error, and the coefficient reflects the extent to 
which scores are determined by a stable general factor. Parallel tests 
should ordinarily have the same general and group factors. Were 
items in the two forms matched to test the same specific items of in- 
formation or skill, the equivalent tests might to some degree include 
the same specific factors. The specific factors in the two tests could 
not be completely the same, however, unless the items were identical. 
The coefficient of equivalence is a property of a pair of tests and will 
vary according to the kind of similarity established in equating the 
tests. To the degree that parallel tests have the same general and 
group factors, the coefficient indicates the stability of inane 
in the general and group factors. 


The split-half method. The widely used split-half method requires 
the correlation of half the items in the test with the remaining items. 
Cronbach has studied the effect of various splits upon the resulting 
coefficient (1) and has suggested the use of parallel splits, in which 
the two halves are made nearly equivalent (2). In the parallel split, 
each part represents the general factor and the group factors of the 
original test as well as possible. The half-tests should have equal 
standard deviations. The procedure makes no assumption of con- 
stancy, but does include the specific-factor variance as error variance. 
The split-half estimate is a coefficient of equivalence, estimating the 
correlation of simultaneously administered parallel tests, as like each 
other as are the halves of the test given. Any failure in splitting to 
obtain equivalent halves will tend to lower the correlation obtained. 
An assumption of experimental independence is made in considering 
the split-half correlation an estimate of the parallel-test correlation. 
In testing by parallel tests, the performance on one form is presum- 
ably independent of performance on the other. When items are pre- — 
sented together, however, there is always the possibility of spurious 
inter-item correlation due to item linkages and brief fluctuations of 
mood and attention. 

Most random or odd-even splits do not represent all factors 
equally in both halves. If the assumption of experimental indepen- 
dence were valid, the correlation would therefore be an underestimate 
of the coefficient of equivalence. Guttman (3, 260) states that the 
corrected split-half coefficient is always a lower bound to “the reli- 
ability coefficient,” no matter how the test is split. He cautions that 
this inequality is true only for an indefinitely large sample of per- 
sons. Sampling errors in practice preclude taking as one’s coefficient 
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the largest of many trial split coefficients. Guttman defines reliabil- 
ity in terms of repeated independent trials of the same (not equiva- 
lent) tests. By this definition, the split-half estimate, including spe- 
cificity as an error of measurement, is a low one. The coefficient of 
equivalence is a conservative estimate of the hypothetical self-corre- 
lation. 

The assumptions of the Spearman-Brown formula have been 
stated in various ways, and this has led to some confusion as to the 
applicability of the formula. The derivation hypothecates equivalent 
tests and predicts their correlation from the correlation of equivalent 
half-tests. Equivalence is the only assumption made, and in the deri- 
vation equivalence is defined by requiring equal standard devia- 
tions of the half-tests and by requiring that the hypothetical equiva- 
lent tests be just as similar as the half-tests (7a = ‘as = Tvs = Tas). 
This defines equivalence so that all tests have the same common factor 
composition. It makes no direct assumption of the equivalence of pairs 
of items or of the unit-rank among the item intercorrelations. 

The items of a test may be considered as a sample of some larger 
population. One may define the purpose of the test in terms of the 
population of items to be measured; the test fulfils this purpose inso- 
far as the items are a representative sample of the population. Alter- 
natively, one may consider the test as defined by its items, and think 
of the population as the entire group of items of which the sample 
is representative. The coefficient of equivalence (obtained by the 
parallel-test or internal consistency methods) correlates two samples 
of items and indicates the extent to which the variance in each may 
be attributed to common factors. The extent of common-factor load- 
ings is the extent to which test scores are determined by “the popula- 
tion variable.’”’ If the samples to be compared must be representative, 
rather than random, it is necessary, in split-half procedures, to use 
the parallel split or a split according to a table of specifications. 


The Kuder-Richardson formulas. A radical reformulation of the 
reliability problem was offered in 1937 by Kuder and Richardson (8). 
They proposed several alternative formulas which have been widely 
adopted. The original derivation has been criticized because of the 
numerous assumptions made, but other writers have developed the 
same formulas more directly. Perhaps the simplest derivation was 
published by Jackson and Ferguson (5, 74). They define reliability 
as a coefficient of equivalence, equivalence being defined by requiring 
that the two tests have equal variances and that the mean inter-item 
covariance within each test be equal and equal to the mean inter-item 
covariance between tests. If these assumptions are satisfied, the 
Kuder-Richardson formula (20) is an exact estimate of the coefficient 
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of equivalence. This condition is a reasonable one when the items of 
a test are considered as drawn from a population of items all meas- 
uring a single general factor. If group factors are present, even 
though the two tests measures these group factors equally, then, 


r; SiS; < 7;; SiS;-,* and the Kuder-Richardson formula gives a con- 
servative estimate of the coefficient of equivalence—how conservative 
one does not know. 


The Guttman lower bounds. The latest statement of the problem 
is that published by Guttman in 1945 (3). He derives six formulas 
for estimating a coefficient from data obtained on a single testing, 
all the estimates being lower than the “true reliability” if the sample 
is sufficiently great. His estimate L, is identical to that from Kuder- 
Richardson formula (20), although the derivations are dissimilar. 
His L, is equivalent to the split-half coefficient. L., which uses item 
covariances, is an original formula more difficult to compute than L, 
and L,. L,,L;, and L, are expected to have little practical importance. 


Guttman defines error as the variation of the score of a person 
over a universe of independent trials with the same test. His crucial 
assumption, C, (3, 265-266), defines independence so that the score 
of a person on any item on any trial is experimentally independent 
of his scores on any other items. In practice, changes in motivation, 
function shift, and other variables cause items administered together 
to vary together. Guttman classes shifts in the variables measured 
as errors of measurement and therefore is estimating a coefficient of 
stability when he demonstrates that the correlation between two in- 
dependent trials on a large population may be taken as equal to “the 
reliability coefficient” (3, 268). 

In deriving lower-bounds formulas, Guttman deals with hypo- 
thetical independent retests in which the mean covariance of two 
items within trials equals the mean covariance of the same items be- 
tween trials. Beyond this he makes no assumption. His definition of 
independence requires that there be no shift in the variables meas- 
ured between trials; i.e., that the hypothetical trials be simultaneous. 
Since he is using identical tests simultaneously, he has defined reli- 
ability as the hypothetical self-correlation. His formulas lead to un- 
derestimates of that coefficient. 

One may study the effect on Guttman’s results if his assumption 
of independence within trials is denied. This may occur when one 
item influences the answer to another by giving a clue, by causing 
encouragement or discouragement, or by setting up a pattern among 


* ie., the mean inter-item covariance within tests is less than the mean inter- 
item covariance between tests. 
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the responses. In the derivation of L,, the assumption leads to dis- 
carding a positive covariance term from the right member of (28). 
As a consequence, 4; and L, are greater than they would be without 
the assumption, and may overestimate the hypothetical self-correla- 
tion as defined. In the derivation of L., L;, and L,, the assumption 
is felt in (25), where a positive covariance term is dropped from the 
right member. Without the assumption, 


rc y 
Yer 7; a 7X xX, » Gg a J> 


and the inequality given in (37) may not hold. The remainder of the 
derivation therefore may lead to estimates higher than the hypotheti- 
cal self-correlation, if the assumption of experimental independence 
of items does not hold. 

This weakness is common to all estimates of reliability based on 
a single trial. Lindquist (9, 219) points out that in the split-half 
method the two halves are falsely assumed to be experimentally inde- 
pendent, and therefore he considers the split-half estimate spuriously 
high. [He, however, defines reliability as what we have called the 
coefficient of stability and equivalence (9, 216)]. In the Kuder-Rich- 
ardson formula, as derived by Jackson and Ferguson, the same as- 
sumption of independence is made when the mean inter-item covari- 
ance between tests is taken as equal to the mean covariance within 
tests. If motivation, response sets, and other factors common to per- 
formance on the various items of a trial are considered part of the 
general or group factors measured by the test, their contribution to 
the inter-item correlation within a trial is rightly included in the 
estimate of accuracy of measurement. But momentary variations 
which cavse random changes in item covariance should not be per- 
mitted to raise the estimate obtained. Any estimate of self-correla- 
tion or equivalence based on a single trial may be higher than the 
hypothetical self-correlation. It may be treated as a conservative or 
exact estimate only if we are willing to assume that the response to 
each item is an independent behavior, related to response on other 
items only because of significant conditions in the person tested. 

Guttman makes the point that his split-half formula 


8,? + 8,? ) 


8;? 


(5) 





L.=2(1- 


is superior to the Spearman-Brown formula in that it does not assume 
the two half-tests to have equal variance. His formula can be derived 
as an estimate of the coefficient of equivalence, according to the usual 
proof of the Spearman-Brown formula, except that equivalence is de- 
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fined so that ou.» = oasg, ANd Tesgc04 = Pupoaee = 1 ayoao = Yosoves = 
Tapas, . This leads to a formula identical to Guttman’s, or an equiva- 
lent form previously derived by Flanagan (see Kelley, 7) which is 
less readily computed. Values obtained using this formula are small- 
er (usually by a small amount) than the values from the Spearman- 
Brown formula, except where s, = s,. It appears that this formula 
should replace the Spearman-Brown procedure. 


Summary 

Four possible definitions of ‘reliability’? have been considered. 
The hypothetical self-correlation requires independent simultaneous 
identical tests. For psychological variables this is a hypothetical sit- 
uation, and no one has found an unbiased estimate of this coefficient. 
Guttman’s formula L. would be a conservative estimate of the hypo- 
thetical self-correlation, save for the necessity of assuming that re- 
sponses to one item are not influenced by responses to another item. 
Guttman’s L, is ordinarily greater than the estimate from the Kuder- 
Richardson formula. 

The coefficient of equivalence is lower than the hypothetical self- 
correlation. Kuder-Richardson formula (20) is an exact estimate of 
the coefficient of equivalence for tests where the item intercorrelation 
matrix has rank one; otherwise the estimate is conservative. This, 
however, like all estimates of equivalence, assumes experimental in- 
dependence of items within one trial. The parallel-split method gives 
an estimate of the coefficient. of equivalence. For an ideally large 
population, the highest split-coefficient is the best estimate, and esti- 
mates from other splits are conservative, save for the failure of in- 
dependence of items. 

The coefficient of stability is lower than the hypothetical self- 
correlation. It is estimated by the test-retest correlation, but carry- 
over from one test to another may cause the estimate to be faulty. 

The paraliel-tests correlation is an estimate of the coefficient of 
stability and equivalence. It may be unduly high if the two tests are 
not experimentally independent. Otherwise, the estimate will ordi- 
narily be lower than the coefficient of stability or the coefficient of 
equivalence. 

A simple table may indicate the different meanings of the vari- 
ous procedures. In Table 1, checks indicate the variances which are 
included in the error of measurement, according to each procedure. 
In the absence of sampling error, any estimate of reliability is less 
than the hypothetical self-correlation, assuming experimental inde- 
pendence. Every procedure assumes either the experimental indepen- 
dence of trials or of items within the trials. This condition is rarely 
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satisfied, and any obtained coefficient may therefore be higher than 
the coefficient supposed to be obtained. 


TABLE 1 
Variances Included in Error Variance of a Test, According to 
Various Formulations of the Reliability Problem* 


m oa 
we ° 
g A § 2 
ov Oo 
Saas 5o py 2 
@ © 9 Soh 3s Pom > = 
—_— 9 OW 950 een’ og om s 2 © 2 
Fb 255 28s S96 Sa B98 £3 
v-m 3 em Opes CS. ® C.5 5 BS o3 
SO SUH DOH BVHE HHO HRD HB 
ons Oe a8 8 8 o 2c mM oS & m 
Cn> On> am> FE>o €>u Eon we 
Test-Retest x x x x 
Parallel Test x x x x x 
Parallel Split x x 
Random Split x x x 
Kuder-Richardson (20) x x x 
Guttman L, xt 
Hypothetical Self-Correlation x 
Coefficient of Equivalence x, = 
Coefficient of Stability x % x x 
Coefficient of Stability 
and Equivalence x x x x x 


* An 2 indicates that the variance indicated is included in the error of measurement by the pro- 
cedure or definition listed at the left. 3 

_tIn equations (31) and (43), Guttman sets up inequalities which overestimate the item error 
variance. 


Practical Implications 

No one “best” estimate of reliability exists. If one could validly 
make the assumption of stability between trials, and independence of 
trials, the test-retest correlation would be satisfactory. Frequently 
we must rely on single-trial estimates. Guttman’s L, or a parallel- 
split used with his L; will in general give the highest coefficients. 
Where the test measures a single factor, the Kuder-Richardson for- 
mula (Guttman’s L,) should be as useful as the other two procedures. 

In many situations, it is appropriate to seek a coefficient other 
than the hypothetical self-correlation. In correcting for attenuation, 
any of the coefficients described in this paper may be appropriate. 
Following the lead of Remmers and Whisler (11), one may distin- 
guish between the “true instantaneous score” in a variable (related 
to the self-correlation or the coefficient of equivalence) and the “true 
score” in a trait (related to the coefficient of stability or of stability 
and equivalence). Sometimes one wishes to know the correlation be- 
tween true scores in two traits postulated as stable over a period of 
time—“somatotype” vs. “temperament” is a typical problem. Here 
the appropriate coefficients for use in the attenuation formula are the 


Items 
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coefficient of stability (if the trait is defined operationally by a spe- 
cific test) or the coefficient of stability and equivalence (if the trait 
is defined by a family of similar tests). Other problems call for study- 
ing the relation between true instantaneous score in one variable 
(such as an aptitude test) and true score in another defined as stable 
(such as job performance). For this, the reliability of the former 
score would be based on a coefficient of equivaience (since the hypo- 
thetical self-correlation is not known), and the reliability of the lat- 
ter would be based on one of the coefficients involving stability. The 
third possibility, and one of much theoretical importance, is a prob- 
lem regarding true instantaneous scores in two variables, such as 
mood and performance. The correction for attenuation here requires 
use of two coefticients of equivalence. 

Similar reasoning applies to the problem of estimating the sig- 
nificance of changes in test score. If the identical test is given both 
times, the coefficient of stability is appropriate. The hypothetical self- 
correlation, if known, would test whether a significant change in be- 
havior had occurred, although this change might be due to normal 
diurnal fluctuation. The coefficient of stability tests whether the 
change is greater than that “normally” to be expected due to function 
fluctuation. If growth is measured by equivalent tests, a coefficient 
of equivalence, or of stability and equivalence, is rejievant. 

In evaluating a test, all four coefficients are of interest. For 
most purposes, one wishes to measure stable characteristics, so that 
a coefficient of stability is needed. For research purposes, however, 
a test having high instantaneous self-correlation or equivalence and 
low stability may be very satisfactory. 

The coefficient of stability is an abstraction; in reality, there is 
an indefinitely large number of such coefficients, corresponding to 
various time intervals between tests. For meaningful use of such a 
coefficient, it must be defined as “the coefficient of stability over one 
week,” or the iike. The coefficient also depends on the conditions af- 
fecting the subject between testings. Strictly speaking, a coefficient 
of stability may be carried over to a new situation only when the time 
interval and the conditions between testings are similar to those un- 
der which the coefficient was obtained. The coefficient of stability 
would be better understood if research were available showing how 
the coefficient varies with increasing time lapse. 

The following recommendations result from the analysis made 
above. 

1. Reliability for psychological measurement can never be ob- 
served as in the physical sciences, where variables are practically 
constant and non-hysteretic. All estimates of reliability require as- 
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sumptions unlikely to be fulfilled. 

2. Several coefficients numerically less than the hypothetical 
self-correlation can be estimated. A distinction between these vari- 
ous coefficients should be made; the writer proposes the names coef- 
ficient of equivalence, coefficient of stability, and coefficient of sta- 
bility and equivalence. 

3. The coefficient of equivalence may be estimated by the par- 
allel-split method, using formula (5),-Guttman’s L,. The Kuder- 
Richardson formula (20) underestimates this coefficient unless the 
test item matrix has rank one. Guttman’s L. gives an underestimate 
of the hypothetical self-correlation which may or may not be higher 
than the coefficient of equivalence. All estimates of reliability or 
equivalence based on a single trial assume that test items are experi- 
mentally independent. To the extent that this is untrue, estimates 
may be erroneously high. 

4. The coefficient of stability may be estimated by the test-re- 
test method, with an undetermined error due to failure of indepen- 
dence. The coefficient of stability and equivalence may be estimated 
by the correlation of parallel tests, with a similar error. 

5. In describing a test, the author should provide separate es- 
timates of the coefficient of equivalence and the coefficient of stability. 
The time interval used in obtaining the coefficient of stability should 
be reported. If there are multiple forms, the coefficient of stability 
for each should be given. 

6. In practice, the coefficient of equivalence or the coefficient of 
stability may be used meaningfully where the reliability coefficient 
is called for. The coefficients are not interchangeable and have dif- 
ferent meanings in corrections for attenuation, standard errors of 
measurement, and like applications. The hypothetical self-correlation, 
showing the extent to which a test measures real but possibly mo- 
mentary differences in performance, is more important to the theory 
of measurement than to the practical use of tests. 
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TABLE FOR DETERMINING PHI COEFFICIENTS 


C. E. JURGENSEN 
MINNEAPOLIS GAS LIGHT COMPANY 


A table.is presented which directly gives phi coefficients accu- 
rate to three places when entered by the proportion of one sub-group 
responding in a specified manner and the proportion of a second 
sub-group responding in the same manner. ‘Che table gives coeffici- 
ents identical with those obtained by formula if the sub-groups are 
equal in number. The phi coefficients can readily be expressed, if 
desired, in terms of critical ratio or chi square. The table is more 
accurate than the use of abacs and eliminates the use of time-con- 
suming formulas. Accurate determination of item validity on the 
basis of statistically rigorous techniques can be made more quickly 
by means of the table than validity determined by less efficient meth- 
ods which have previously been used to save time. 


Increasing emphasis on item analyses is being found in test con- 
struction and development, particularly with regard to item validity 
as estimated by means of internal consistency or an outside criterion. 
Long and Sandiford (5) have summarized the methods which were 
most popular a decade ago, and test coystruction since that time 
seems to have continued to follow those methods in the main. 

Methods for determining whether individual items should be re- 
tained or rejected (or the closely allied problem of determining what 
weights should be assigned each item) vary from a simple determina- 
tion of the difference in per cent of persons in contrasted groups who 
respond similarly to a given item, to the more rigorous correlational 
techniques or levels of significance. Because of the great amount of 
computational time necessary to determine item validities on the ba- 
sis of correlational techniques, the simpler and less efficient tech- 
niques have too often been used. This is particularly unfortunate in 
those cases where the number of cases is small and has often resulted 
in retaining items purely on the basis of chance differences. In an 
effort to reduce the time required for the more accurate types of item 
validation, various tables, nomographs, and abacs have been devel- 
oped. 
One of the earliest of these techniques was a table developed by 
Edgerton and Paterson (1) which gives standard errors from .001 
to.100 by successive steps of .001 for all possible combinations of 
differences between percentages. For a maximum percentage differ- 
ence of 50, this table covers a maximum standard error for 25 cases 
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and a minimum standard error for one million cases. These data do 
not give item validity, but do permit necessary computations to be 
made more readily than otherwise. 

Votaw (8) has published the formulas necessary to construct an 
abac based on the probable error of differences between two groups 
as well as an abac based on a critical ratio of 2 (in terms of probable 
error) when the N of each subgroup equals 45. Mosier and Mc- 
Quitty (7) have published a more detailed abac giving critical ratios 
of 2, 3, 5, 7 and 10 (in terms of standard error) based on the highest 
and lowest fifty cases of a group having a total N of 200. Mosier and 
McQuitty also published correlational abacs based on upper-lower 
halves and upper-lower quarters. Guilford (4) has published an abac 
based on phi coefficients ranging from 0 to +.90 in steps of .10 and 
another giving 1% and 5% levels of significance when the total N 
equals 50, 100, 200, and 400. 

Lord (6) has given an alignment chart for calculating the four- 
fold point correlation coefficient, the chart being entered on three 
values: per cent of cases successful with respect to the first variable, 
the similar per cent for the second variable, and the per cent of cases 
successful with respect to both variables. 

Fiske and Dunlap (2) have published a formula for construct- 
ing an ellipse on the assumption that the two sub-groups are random 
samples from the same parent population and that the best estimate 
of the true proportion is the weighted mean proportion of the two 
samples. Fiske and Dunlap’s abac is based on a critical ratio of 2 
with one hundred persons in each sub-group. 

Numerous other abacs and nomographs have been constructed. 
Although they all possess the advantage of reducing statistical com- 
putations, other objections are inherent in their use. Abacs and no- 
mographs based on critical ratios, level of significance, or chi square 
are necessarily constructed for use in situations having a specified 
number of cases in each sub-group. As the N is changed, so must the 
abac be changed also. Inasmuch as construction of an abac requires 
computing numerous points by means of formula and careful draw- 
ing of a curved line to fit these points, and because the number of 
abaecs which can be drawn is unlimited, this procedure becomes im- 
practical. 

Rather than constructing an abac for each possible N , it is also 
possible to devise abacs for various selected N’s, and in any single 
study the abac can be used which most closely approximates the avail- 
able data. Although such approximations are sufficiently accurate 
for most practical purposes, many research workers are reluctant to 
state in the literature that their work is based on approximations 
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lest such statement be interpreted as an indication of flatulent work. 

Another possible procedure is to discard data to the extent that 
the N fits one of the available abacs. This procedure may be satis- 
factory when data are based on several hundred cases but cannot be 
recommended when the original N is small. Necessity frequently dic- 
tates a small N, and it is inadvisable to further reduce the N in or- 
der to permit quicker analysis of data by means of an available abac. 

Another difficulty inherent in abacs is the difficulty of accurately 
determining exact item validity when interpolations must be made 
between the ellipses of an abac. The error of such estimates may be 
further increased by the necessity for interpolation at the points of 
entry on the abac. 

Some of the objections to nomographs and abacs can be over- 
come by using a table which is entered in a similar manner; namely, 
the proportion of one sub-group responding in a specified way sup- 
plies the vertical entry and the proportion of another sub-group re- 
sponding in the same way is entered in the horizontal dimension. 
Exact item validity can then be read at the point of intersection of 
the column and row. 

In order to be of maximum value, a table of this type should be 
expressed in terms which permit its use with any desired number of 
cases and should be such as to permit rapid determination of the degree 
of significance of any difference. The table of phi coefficients accom- 
panying this article fulfills these conditions. The following interrela- 
tionships obtain when the number of cases in the two sub-groups is 


equal: 








xr CR 
Phi Coefficient (¢) = =— _; (1) 
Not VNeot 
Critical Ratio (CR) =¢ VNin = V7} (2) 
Chi Square (77) = N¢?=CR?. (3) 


A Pearson r corresponding to the phi coefficient can be estimated 
if desired. If both variables can be considered as being continuous, 
the Pearson r is estimated by dividing ¢ by .637. If one set of data 
are genuinely dichotomous and the other is continuous but artificially 
reduced to a dichotomy, the Pearson r can be estimated by dividing ¢ 
by .798, although, as Guilford (3) points out, the meaning of such 
figures is questionable and interpretation should be made only with 
extreme caution and with cognizance of the steps by which the coef- 
ficient was derived. If true point distributions are involved ¢ is nu- 
merically equivalent to the Pearson r . 
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In addition to being readily converted to CR and 7’, the phi co- 
efficient has the added advantages of being widely applicable in many 
situations, and being one of the few coefficients which can properly 
be used in some of these situations. 

Assuming that the sub-groups are equal in size, the formula for 
¢ is usually expressed as: 

= (4) 
Vpq 
where p,, = the proportion in terms of total N of one sub-group (up- 
per) responding in a specified way, 
p, =the proportion of the other sub-group (lower) respond- 
ing in the same way, 
p =the total proportion (p, + p,) responding in the specified 
way, 
q = the total not responding in the specified way (1.00 — p). 

In item analysis the test constructor usually deals with propor- 
tions expressed in terms of each sub-group rather than the total N. 
In such case p + g = 2.00. The formula can then be expressed as: 


Pus «Dr 


B® 
ae (5) 

Poe 

2.2 


where p = p, + p, and g=2— (p, — pi). 
Formula (5) can be expressed entirely in terms of p, and 7; as: 





Pu— Di 
ye (6) 
V (Du + D1) (2 — Pu — D1) : 

Formula (6) was used in constructing the accompanying table. 
The 200 coefficients on the outer edges of the table were computed 
by formula to be accurate to six decimal places. The 4850 remaining 
coefficients were obtained by successive subtraction of a constant from 
each of the outer-edge entries. The constant results from the fact 
that p, + p, remains the same and p, — p, decreases systematically 
by .02 on any diagonal from the outer edge which is perpendicular 
to the center diagonal which separates the table into positive and 
negative coefficients. For example: Select a p, of .71 and p, of .00, 
which appears on the outer edge of the table. On a diagonal perpen- 
dicular to the center dividing line, p, and p, progressively become .70 
and .01, .69 and .02, .68 and .03, .67 and .04, etc. In each case p,, + p, 
= .71, and p, — p: progressively decreases by .02 from .71 to .69, .67, 
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.65, .63, etc. The subtracted constant was therefore obtained by mul- 
tiplying the reciprocal of the denominator of formula (6) by .02. Suc- 
cessive subtractions from the outer-edge phi coefficient computed by 
formula thus gave each of the other entries in the diagonal. 

Several checks were made to insure accuracy of the table: (1) 
Each of the two hundred outer-edge phi coefficients was separately 
computed by two persons, (2) each of the two hundred subtractive 
constants was computed separately by two persons, (3) original phi 
coefficients and subtractive constants were checked on the basis of 
comparable quadrants of the table, and (4) each row and each col- 
umn of the final table were checked separately by two persons on the 
basis of comparable quadrants. 

The table of phi coefficients is simple to use. The proportion of 
each of two sub-groups responding to an item in a specified manner 
is determined, the proportions being computed on the basis of the 
number of cases within each sub-group rather than the number of 
cases within the total group. The table is entered vertically by one 
of the proportions and horizontally by the other. The phi coefficient 
is given at the point of intersection of the row and column entries. 
For purposes of reducing the size of the table, only the positive co- 
efficients are included. If entry in the usual manner leads to a blank 
cell in the table, the coefficient can be found by reversing the hori- 
zontal and the vertical entry. The manner of entry together with the 
type of data being handled readily indicates whether the coefficient 
is positive or negative. 

The table assumes an equal number of cases in the two sub- 
groups, and if this assumption is met the coefficient is the same as a 
coefficient computed by formula. The greater the difference between 
the N’s of the two sub-groups the greater will the table coefficient dif- 
fer from the computed coefficient. 

Knowledge on the part of the user of the number of cases in- 
cluded in the groups will permit a quick determination of critical ra- 
tio, level of significance, or chi square as given by formulas (1), (2), 
and (3). The user can then set his own standards for accepting or 
rejecting items, and for assigning weights to items. 

Inasmuch as this table permits rapid and accurate determination 
of item validity on the basis of statistically rigorous techniques, there 
is no justification for using less efficient methods. Such inefficient 
methods now require as much time as the more acceptable, but hither- 
to time-consuming methods. 
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00 01 02 03 04 05 06 O07 O08 O09 10 11 12 18 14 25 16 17: 8 OD 


100 1000 990 980 970 961 951 942 932 923 914 905 895 886 877 869 860 851 842 834 825 
99 990 980 970 960 950 941 931 922 912 903 894 884 875 866 857 848 839 831 822 818 
98 980 970 960 950 940 930 921 911 902 892 883 874 864 855 846 837 828 819 810 802 
97 970 960 950 940 930 920 910 901 891 882 872 863 853 844 835 826 817 808 799 790 
96 961 950 940 930 920 910 900 890 881 871 862 852 843 833 824 815 806 797 788 779 
95 951 941 930 920 910 900 890 880 870 861 851 842 832 823 813 804 795 786 777 768 
94 942 931 921 910 900 890 880 870 860 850 841 831 821 812 803 793 784 775 766 756 
93 932 922 911 901 890 880 870 860 850 840 830 821 811 801 792 783 773 764 755 745 
92 923 912 902 891 881 870 860 850 840 830 820 810 801 791 781 772 762 753 744 734 
91 914 903 892 882 871 861 850 840 830 820 810 800 790 781 771 761 752 742 733 724 
90 905 894 883 872 862 851 841 830 820 810 800 790 780 770 761 751 741 732 722 718 


89 895 884 874 863 852 842 831 821 810 800 790 780 770 760 750 741 731 721 712 702 
88 886 875 864 853 843 832 821 811 801 790 780 770 760 750 740 730 721 711 701 692 
87 877 866 855 844 833 823 812 801 791 781 770 760 750 740 730 720 710 701 691 681 
86 869 857 846 835 824 813 803 792 781 771 761 750 740 730 720 710 700 690 681 671 
85 860 848 837 826 815 804 793 783 772 761 751 741 730 720 710 700 690 680 670 661 
84 851 839 828 817 806 795 784 773 762 752 741 731 721 710 700 690 680 670 660 650 
83 842 831 819 808 797 786 775 764 753 742 732 721 711 701 690 680 670 660 650 640 
82 834 822 810 799 788 777 766 755 744 733 722 712 701 691 681 670 660 650 640 630 
81 825 813 802 790 779 768 756 745 734 724 713 702 692 681 671 661 650 640 630 620 
80 816 805 793 781 770 759 747 736 725 714 704 693 682 672 661 651 641 630 620 610 


79 808 796 784 773 761 750 738 727 716 705 694 683 673 662 652 641 631 620 610 600 
78 800 788 776 764 752 741 729 718 707 696 685 674 663 653 642 632 621 611 600 590 
77 791 779 767 755 744 732 720 709 698 687 676 665 654 643 633 622 612 601 591 580 
76 783 771 759 747 735 723 712 700 689 678 667 656 645 634 623 612 602 591 581 571 
7 775 762 750 738 726 714 703 691 680 669 657 646 635 625 614 603 592 582 571 561 
74 766 754 742 730 718 706 694 682 671 660 648 637 626 615 604 594 583 572 562 551 
73 758 746 733 721 709 697 685 674 662 651 639 628 617 606 595 584 573 563 552 542 
72 750 737 725 713 700 688 677 665 653 642 630 619 608 597 586 575 564 553 543 532 
71 742 729 717 704 692 680 668 656 644 633 621 610 599 588 577 566 555 544 533 523 
70 734 721 708 696 684 671 659 647 636 624 612 601 590 578 567 556 545 535 524 513 


69 726 713 700 688 675 663 651 639 627 615 603 592 581 569 558 547 536 525 514 504 
68 718 705 692 679 667 654 642 630 618 606 595 583 572 560 549 538 527 516 505 494 
67 710 697 684 671 658 646 634 621 609 597 586 574 563 551 540 529 518 507 496 485 
66 702 689 676 663 650 637 625 613 601 589 577 565 554 542 531 519 508 497 486 475 
65 694 681 667 654 642 629 616 604 592 580 568 556 545 533 522 510 499 488 477 466 
64 686 673 659 646 633 621 608 596 583 571 559 547 5386 524 513 501 490 479 468 457 
638 678 665 651 638 625 612 600 587 575 563 550 539 527 515 503 492 481 469 458 447 
62 670 657 643 630 617 604 591 579 566 554 542 530 518 506 494 483 472 460 449 438 
61 662 649 635 622 608 595 583 570 557 545 533 521 509 497 485 474 462 451 440 429 
60 655 641 627 614 600 587 574 561 549 536 524 512 500 488 476 465 453 442 431 419 


59 647 633 619 605 592 579 566 553 540 528 515 503 491 479 467 456 444 433 421 410 
58 639 625 611 597 584 570 557 544 532 519 507 494 482 470 458 447 485 423 412 401 
57 631 617 693 589 576 562 549 536 523 510 498 486 473 461 449 438 426 414 403 391 
56 624 609 595 581 567 554 541 527 514 502 489 477 464 452 440 428 417 405 394 382 
55 616 601 587 573 559 546 582 519 506 493 480 468 456 443 431 419 408 396 384 373 
54 608 593 579 565 551 537 524 510 497 484 472 459 447 434 422 410 398 387 375 364 
53 600 586 571 557 543 529 515 502 489 476 463 450 438 425 413 401 389 377 366 354 
52 593 578 563 549 535 521 507 498 480 467 454 441 429 416 404 392 380 368 356 345 
51 585 570 555 541 526 512 498 485 471 458 445 432 420 407 395 383 371 359 347 335 
50 577 562 547 532 518 504 490 476 463 450 436 424 411 398 386 374 362 350 338 326 
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03 04 05 06 O07 08 O09 
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467 
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433 
425 
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Under controlled flight conditions, the distance between a 
navigator’s report of position and his actual position is a criterion 
of success in dead reckoning navigation. Students’ logs were eval- 
uated for five separate missions by comparing the students’ entries 
with standards determined by experts. The reliability of this tech- 
nique is indicated by the fact that mission to mission intercorre- 
lations of error scores were low, while the intercorrelations be- 
tween legs of the same mission were moderately high. The inter- 
correlations between the error scores for the different navigation 
variables were computed and analyzed by using both factor analysis 
and multiple regression techniques. Both analyses indicated that 
a major portion of all dead reckoning error could be attributed to 
errors made in determining magnetic deviation. As a result of 
these analyses, recommendations were made for changing the in- 
struction in dead reckoning and alterations in the equipment used 
were suggested. 


1. The Problem 


Since one of the most difficult tasks in psychology is the objective 
evaluation of complex skills, one of the most difficult tasks of the 
aviation psychologist is the objective evaluation of flight performance. 
It is the purpose of this paper to discuss a method for evaluating navi- 
gators’ dead reckoning proficiency as an example of the difficulties 
encountered in assessing aerial performance and also to describe a 
method by which a successful evaluative technique may be used to 
analyze performance critically and to lead to its improvement. 

At first glance it would seem that the evaluation of dead reckon- 
ing would be simple and straightforward. The navigator’s plane de- 
parts from a known position; by determining the true air speed, the 
wind, and the true course flown, the navigator calculates his position 
for a particular time without recourse to check points on the ground. 
By comparing the navigator’s calculated position with his actual po- 
sition, it is possible to obtain an immediate, objective measure of the 
accuracy of his work. In actual practice this can reasonably be done 

* This work was done as a part of the AAF Aviation Psychology Program. 


The authors are indebted to Thomas Paltier, Harley Smith, Wolcott, Lyon, and 
John King for assistance with the calculations. 
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for the navigators in any one plane, but as soon as it is desired to 
compare the performance of navigators from plane to plane, it is 
found that the results are not comparable due to a multitude of un- 
controlled flight conditions. Even though two planes depart from the 
same point for the same destination, it is not proper to evaluate the 
navigators’ relative skill on the basis of their objective results in terms 
of position error unless most careful controls have been instituted to 
assure similar flight conditions for both planes. The development of 
an objective scale for evaluating navigation skill involved setting up 
conditions in which all the navigators would be exposed to the same 
situation in the air, perform the same kind of navigation, and have 
their performances evaluated in terms of a precise standard and in 
an objective and equitable manner. At the same time, the technique 
used had to insure that the missions were flown safely, represented an 
efficient expenditure of plane time, and were consistent with the other 
operational problems encountered at a training school. In addition, an 
adequate technique had to be reliable and reproducible. 


2. Description of the Technique 
The technique employed was to fly the navigation missions in 
formation and to require the navigators to perform “follow-the-pilot” 
navigation, that is, dead reckoning navigation in which the navigator 
simply determines the position of the plane at specified times but does 
not direct the pilot regarding the course he is to fly. By flying forma- 
tions in which the planes were within a few yards of each other, all 
the students were flown over the same course, at the same speed, and 
under the same weather conditions. Several different types of forma- 

tions and planes were used during this experiment. 


Each mission consisted of four legs with each leg covering ap- 
proximately 100 miles and terminating at selected turning points. 
At each turning point the formation changed course by approximately 
90°. The turns were made consistently in one direction so that a some- 
what square track was followed. Flying the missions over a square 
track yielded three advantages: (1) it was possible to return to the 
place of departure and thus secure maximal use of student and plane 
time; (2) by turning 90° at each turning point, the students were pro- 
vided with an opportunity to obtain the best estimate of wind on two 
headings; and (3) four legs of 100 miles each made possible the collec- 
tion of adequate, comparable, and independent samples of navigation 
on each leg. At the same time that the students were required to deter- 
mine their position at each of the four turning points by dead reckon- 
ing, expert lead navigators in the first plane of the formation were di- 
recting the course flown and keeping a precise record of the actual 
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track made good over the ground. While the student navigators had to 
rely only on their calculations, the lead navigators actually looked at the 
ground and carefully determined where they were at all times. Other 
personnel in the planes made very accurate readings of the air speed 
meter, the compass, the drift meter, and the astro-compass. In this 
way extremely accurate standards were developed with which the 
students’ log entries could be compared. Since the students all navi- 
gated under the same flight conditions and objective standards were 
available for evaluating their performance, it was believed that an 
adequate technique had been developed. 


3. The Reliability of the Technique 

In October and November of 1944, approximately 180 students 
were flown on three different flight missions, and again during the 
summer of 1945, 80 students were flown on four different flight mis- 
sions spaced throughout the navigation course. From the data collected 
on these missions, it was possible to analyze the reliability of the 
technique. Two types of analysis were possible, namely, an intra-mis- 
sion reliability which corresponds to a split-half reliability and an in- 
ter-mission reliability which corresponds to a test-retest reliability. 

Since the missions were flown by legs, there were four indepen- 
dent evaluations of each navigator’s performance. The correlation be- 
tween the errors made on the first and fourth legs and the errors 
made on the second and third legs were computed for a number of 
navigation variables ; however, only the reliabilities for true air speed 


TABLE 1 


Within Missions Reliability Coefficients for True Air Speed 
(Legs 1-4 With Legs 2-3) 




















‘Flight For- Mission 5 Mission 6 Mission 7 All Missions 
mation N Rho N Rho N ~~ Rho Combined r 
OSA 1 28 .88 23 .56 22 -76 78 
Z 19 25 18 75 1 86 58 
23RB 1 23 59 23 84 20 21 48 
2 20 25 20 12 20 61 57 
o4A 1 21 51 21 .68 24 -75 .68 
Z 17 94 17 57 21 Bj -70 
o4B 1 21 56 21 .64 24 68 .63 
2 17 45 18 383 20 48 42 
Combined r , oe 
61 61 


— 7 
r=. 
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will be discussed at this point since the conclusions drawn from the 
true air speed reliabilities are similar to those drawn from the re- 
mainder of the data. Table 1 shows the correlation between the error 
scores of the first and fourth legs and those of the second and third 
legs for the first group of students. 

Since the error scores obtained were not normally distributed 
and it was felt that extreme scores should not bias the reliabilities, 
rank order correlations rather than Pearsonian correlations were cal- 
culated. However, the combined correlations were obtained by con- 
verting the rho’s to r’s and combining them by the z transformation 
and reconverting to r’s. From Table 1 it will be seen that while there 
is considerable variability from formation to formation, the reliabili- 
ties are fairly high, the Spearman-Brown correction giving a relia- 
bility of .77. But from examining the data, it was suspected that this 
reliability was unduly high due to constant factors within particular 
planes; thus, if the air speed meter in one plane were improperly 
calibrated, all of the students within that plane would make high 
error scores which would not be indicative of their true ability. If 
there were no systematic factors within planes, it would be expected 
that the correlation between the error scores of students within the 
same planes would be 0. Table 2 shows the correlation of the error 
scores of students in the same planes by seats. In view of the magni- 
tude of these correlations, it was thought probable that systematic 
plane differences accounted for some of the reliability shown in Table 


a. 


TABLE 2 
Correlation Coefficients 
Seats for True Air Speed on Mission 5 














Flight Seats 1 and 2 Seats 2 and 3 Seats 1 and 3 
N~ Rho N Rho N Rho 
23A 14 36 13 54 13 .22 
23B 14 .20 13 —.12 14 —.11 
24A 10 10 10 49 11 34 
24B 11 21 12 .24 10 65 
Combined r 25 30 .28 





These results indicated that a more appropriate reliability might be 
obtained by analysis of covariance. After the variances associated with 
plane and seat were removed, a coefficient of .48 was obtained for the 
ultra-mission reliability of the individual’s scores. 

Another estimate of the technique’s reliability was obtained by 
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comparing performance on successive missions. The error scores be- 
tween missions were computed and the results are shown in Table 3. 














TABLE 3 
Between Missions Correlation Coefficients for True Air Speed 
Flight For- Mission 5 vs. 6 Mission 6 vs. 7 Mission 5 vs. 7 
mation WN Rho N Rho N Rho 
1 23 —.13 22 04 22 —.08 
aes 2 18 .03 17 —.27 17 —.16 
23B 1 23 —.08 20 —.35 20 —.06 
2 20 —.05 20 .00 20 .28 
1 19 —.61 21 ook 21 14 
24A 
2 15 17 17 13 17 —.19 
1 18 45 21 —.10 21 01 
_ 2 15 —.34 18 .08 17 —.11 
Combined r —.01 .00 .00 





This complete lack of relationship raises serious question regarding 
the adequacy of the technique. However, since the experiments took 
place at a time when the learning curve was steep, the test-retest relia- 
bility coefficient may have been somewhat attenuated. There were 
probably other uncontrolled variables, such as day-to-day fluctuations 
in the planes, the relative skill of pilots, and variable weather con- 
ditions. It is, of course, also possible that the technique of measure- 
ment is sufficiently reliable, but that the navigation task is so diffi- 
cult and so influenced by chance factors that consistent test-retest re- 
sults could not be expected except for a great number of missions. If 
this is the case, it raises serious doubt as to the possibility of ever 
practically achieving an adequate objective evaluation of navigators’ 
performance. It should be remembered that each student navigated 
over twelve hours in furnishing the data on which these reliabilities 
are based. 

Since these reliabilities are based on data collected from the first 
application of the technique and since somewhat unstable AT-7 planes 
seating only three students were used, it was decided to apply the 
technique to a second sample of 80 cadets at Ellington Field where 
larger and more stable C-47’s, seating eight navigators, were available. 
The Ellington Field students were flown on missions in their seventh, 
eleventh, sixteenth, and twentieth weeks of training. Table 4 shows the 
correlation between successive missions and between the total errors 








36 PSYCHOMETRIKA 


on missions 1 and 4 against those on missions 2 and 3. In addition to 
correlations for true air speed, the correlations for deviation, drift, 


and distance-off are shown. 


TABLE 4 
Between Missions Correlation Coefficients from the 
Second Series of Missions 














Missions Missions Missions Missions 
Variable land 2 2 and 3 3 and 4 1+4and2+ 3 
Drift we bd —.01 19 
Deviation % 13 —.03 .09 
True Air Speed 27 18 13 27 
Distance-Off —.10 13 01 —.03 





Certainly, the intercorrelations in this table are not high enough to 
suggest ‘that this technique, even though internally reliable, will show 
consistent positive correlations when administered from time to time. 

Similar correlations have been reported for other measures of 
complex flying skill. The following data have been reported by Psycho- 
logical Research Project (Pilot), Randolph Field, Texas, in their Re- 
port for the Fiscal Year 1945. Table 5 shows the reliability for data 
collected in measuring the performance of elementary pilot students 


in landing. 


TABLE 5 


Reliability of Measures of Elementary Pilot 
Students’ Landing Ability 








Ground vs. _ 1st Landing vs. 1st Day Landings vs. 





Measures Air Observer 2nd Landingon 2nd Day Landings 
Trial1 Trial 2 Day1 Day2 Landing1 Landing 2 
Zone in which 
Plane Lands 0° .86 14+ 11 02+ 01 
Landing Attitude 87 .68 45 18 —.07 15 
Bounced or 
Dropped 89 87 44 19 —.01 01 





* The NW for these six correlations equals 152. 
+ The " for these six correlations equals 170. 


Again it will be noted that the reliability of ratings made for any par- 
ticular mission is high while the reliability of ratings made for diff- 
erent missions on different days tends to be low. Without offering fur- 
ther data, it may be ventured that most measures of flight perform- 
ance for the navigator, pilot, gunner, and bombardier will tend to 
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show low inter-mission correlations even though intra-mission relia- 
bilities may be fairly high. 


4. Use of the Technique in Critically Analyzing Performance 

Whenever it is possible to measure the component parts of any 
performance, an objective assessment of the relative importance of 
the several kinds of operator errors may be undertaken. On logical 
grounds it was possible to attribute poor performance to a number of 
different types of error in navigation; but it was not possible to deter- 
mine objectively the relative importance of these errors until some 
technique had been developed to measure the accuracy both of over-all 
performance and of each of the steps of navigation. The technique pre- 
viously described made it possible to determine the particular opera- 
tions responsible for most of the navigator’s error. By correlating the 
error scores for each navigation operation, a correlation matrix for 
the error scores was constructed. In analyzing this matrix, either by 
factor analysis techniques or by multiple regression techniques, it was 
possible to determine the major causes of navigation error.and their 
relative importance. Tables 6 and 7 show the correlations between the 
error scores for the different navigation variables for three sets of 
data. ° 


TABLE 6 
Intercorrelations Between Errors for 8 Navigation Variables for 165 Students 
Who Were in Their 7th Week of Training at Selman Field 














Track Drift DEV TAS WF WD GS DO |MEAN SD. 
Track ll 7 -08 12 17 «4.06 .78 | 1840 804 
Drift ll 01 00 72 27 52 12] 510 835 
Deviation 7 8 Ol 02 Of 17 11 «667 | 1890 820 
True Air Speed | -.03 .00 -.02 04 -18 28 12 | 820 6.65 
Wind Force a: a a 12 57 18 | 1030 8.10 
Wind Direction | 17 .27 17 -18 .12 27 ~=©.18 | 135.00 87.00 
Ground Speed 06 62 11 28 567 27 30 | 19.20 9.76 
Distance-Off 7 312 67 «4.12 18 18 ~ © .30 | 31.40 14.80 





It should be noted that Table 7 shows two sets of data, one set above 
the diagonal and the other below. Table 8 shows the rotated factor 
loadings and communalities for the three sets of correlations shown 
in Tables 6 and 7. 
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These factor loadings were obtained by the centroid method of factor 
analysis. In these analyses the variable, distance-off, may be con- 
sidered as a criterion since the types of errors are being sought which 
contribute most to over-all inaccuracy in dead reckoning navigation. 
In each analysis in Table 8, the first factor contains the heaviest dis- 
tance-off loading and thus the variables defining this factor will be 
those which are responsible for the largest part of dead reckoning 
error. It will be noted that deviation has a high loading in the first 
factor in each of the analyses. Since deviation and drift are the two in- 
dependent variables making up track, it is apparent that errors in 
deviation are the major cause of error in dead reckoning navi- 
gation. (Drift and deviation are the two independent variables which 
when applied to compass heading determine track; errors in the other 
“heading variables” such as magnetic heading, true heading, and track 
are directly attributable to errors in deviation or drift.) 

The.second factor in each analysis has a high loading in drift and 
also a high loading in one of the wind variables; but it has only mod- 
erate or low factor loadings on the criterion, distance-off. Similarly, the 
third factor has its highest loadings in one of the speed variables, 
ground speed or air speed; and this factor also has only moderate or 
low loadings in the criterion. The fourth factor, in the cases when it 
has been extracted, may be a specific factor associated with the par- 
ticular mission involved. 

From this analysis two important points stand out. First, in all 
the analyses the first factor is most clearly attributable to error in 
the determination of deviation and it is also the most important fac- 
tor in accounting for navigation inaccuracies. The second point is that 
in each analysis three factors, each identifiable by the same loadings 
from analysis to analysis, are clearly identifiable. 

It may be suggested that in each analysis there are certain load- 
ings for one of the factors which do not appear in other analyses. This 
is only to be expected. The psychologist should not think of these mis- 
sions as similar to carefully controlled testing situations since, as has 
been previously mentioned, the actual testing situation changed from 
mission to mission. The wind forces were considerably different on 
each mission ; the type of plane and the type of formation used at Sel- 
man Field differed from those used at Ellington Field. Thus it is sur- 
prising that the results were as consistent as they are. Another reason 
for these fluctuations is the small number of cases in the two Elling- 
ton Field groups. They were not analyzed together since for the first 
flight the average wind force was between 10 and 15 knots while for 
the second flight it was between 20 and 25 knots. The influence of the 
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number of cases may be noticed in the intercorrelations and factor 
loadings for drift, wind force, and wind direction. In the first Elling- 
ton Field analysis three students obtained reciprocal winds by plotting 
negative drifts when they should have been positive and vice versa. 
The result was that there were no errors in wind force to correspond 
to the large discrepancies between actual drift readings and the stu- 
dents’ estimates, but the error in wind direction was at a maximum. 
Thus, the correlation between errors for wind force and drift was low, 
while the correlation between errors for wind direction and drift was 
high. Even in spite of these errors the over-all factorial composition 
is quite consistent. 

To check and extend the above results, the correlation matrices 
presented in Tables 6 and 7 and the results from three later missions 
were analyzed by the use of multiple regression techniques. The vari- 
able, distance-off, was considered the independent variable and beta 
weights were determined for the other variables. Thus the highest beta 
weights would indicate those variables whose errors contributed the 
most to distance-off in dead reckoning navigation. Table 9 shows the 
beta weights and multiple correlations obtained when distance-off is 
predicted by seven navigation variables, and also when it is predicted 
by the three completely independent and basic variables. 


TABLE 9 
Beta Weights and Multiple Correlations Obtained for Predicting Distance-Off 
from Seven Navigational Variables and from the Three 
Basic Navigational Variables 























Group Selman Ellington | 
Week in Training 7th 7th 11th 16th 20th 
| Track 67 53 84 17 61 
. | Drift 00 09 .00 02 .00 
| Deviation 13 28 .00 .20 05 
| True Air Speed .08 .00 .00 .24 07 
S| Wind Force .00 .00 .00 16 .00 
Wind Direction .00 00 16 .05 .07 
Ground Speed 23 ol ou 24 39 
ae 83 87 2 © ©«99 85 
&| Drift Bel 25 15 36 09 
| Deviation 67 71 60 .76 Al 
S| True Air Speed 13 01 —.02 10 .29 





ee 69 17 68 84 49 














42 PSYCHOMETRIKA 


It will be seen that the results of the factor analyses are confirmed and 
that deviation is again the most important of the variables (remem- 
bering that errors in track are a,function of errors in deviation). The 
relative importance of errors in deme air speed and in drift seems to be 
about the same, but both are considerably less important than errors 
in deviation. 

On the basis of these results, it was possible to recommend to the 
personnel responsible for navigation training that the amount of in- 
struction and practice on the different techniques for determining 
deviation be increased. The results also indicated that the instruments 
used for determining deviation might be faulty and the consideration 
of these instruments led to the recommendation of several changes in 
the astro-compass to improve the accuracy with which it could be used. 
It will be noted that in studying a complex skill, it is first necessary 
that a technique be developed for assessing over-all performance of 
the skill and also for assessing the separate components determining 
this performance. Once these mezsures have been developed, it is 
possible to determine statistically the importance of the various parts 
of the task. Such an analysis can then be used as a basis for recommend- 
ing changes in courses of instruction or as a point of departure for 
examining the different parts of the task to determine those aspects 
of performance which can be most profitably improved. 

5. Summary 

A technique for assessing dead reckoning navigation performance 
was developed. The technique consisted of flying a large number of 
students simultaneously in formation over the same course. The re- 
liability of this technique was found to be fairly high within any one 
mission, but to be practically zero when measured by the correlation 
between missions. Considering these reliabilities and others collected 
in measuring pilot’s landing ability, it is concluded that in many com- 
plex skills reliability for any particular trial may be high and yet the 
correlation between trials, which corresponds to test-retest reliability, 
may be low. 

Once the technique for assessing performance had been developed, 
it was possible to determine the principal cause of error in dead 
reckoning performance by the application of factor analysis and mul- 
tiple regression techniques. On the basis of these results, it was pos- 
sible to make recommendations, based on objective evidence, regarding 
changes in instruction and improvement of the instruments used in 


navigation. 
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ANALYSIS IN TERMS OF FREQUENCIES OF DIFFERENCES 


HAROLD A. VOSS 
THE haeceteiaes AND GAMBLE COMPANY 


A technique of analysis utilizing frequencies of differences is 
described and applied to a hypothetical experiment involving two 
methods of instruction. A nomograph is provided for computing the 
chi-square values applicable to the method. 


On several occasions, the author has had the problem of compar- 
ing the effectiveness of two training methods, or aids, where a num- 
ber of experimental conditions were involved. In such case, the inves- 
tigator may employ analysis of variance or the conventional tech- 
niques for evaluating the reliability of differences. However, either 
time pressure or the preliminary nature of the investigation may 
make it advisable to short cut the lengthy computations involved in 
these methods. The method to be described is believed to fulfill the 
need for such a short cut. 

Briefly, the method involves tabulation of the paired measures 
under the varied conditions of the experiment. The measures may 
be scores of individuals, means, standard deviations, percentages, or 
correlation coefficients, depending on the measuring instrument and 
other circumstances of the experiment. A second table is then pre- 
pared from the first, the entries being the frequency of differences 
for each condition favoring the one method over the other. For each 
experimental condition y? is then computed to determine if the dis- 
tribution of differences differs significantly from the expected dis- 
tribution. If the two methods were equally effective, the expected 
distribution of differences would center about zero with an equal 
number of differences favoring each method. 

The hypothetical data in Table 1 have been prepared to illustrate 
the method. Two methods of instruction in puzzle solving are com- 
pared, demonstration alone and demonstration plus explanation. Six 
groups, each consisting of a subgroup instructed by one method and 
a subgroup instructed by the other method, have been selected in 
random fashion from a population of school children relatively homo- 
geneous in age, intelligence, and school grades. For each group 
the schedule of instruction and test is the same. On Days 1 and 2 a 
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period of instruction is followed by a test on three puzzles and two 
weeks later, on Day 15, there is another test on three puzzles but no 
instruction. Use of rewards and differences in puzzle content are 
varied systematically over the groups. To summarize, puzzle solv- 
ing under two types of instruction, demonstration and demonstration 
plus explanaticn, is compared over the following conditions: 


Motivation — reward used 
reward not used 
Test — after first instruction period 
after second instruction period 
two weeks after second instruction period 
Puzzle content — numbers 
geometric patterns 
symbols other than numbers 
Puzzle type — one step 
two step 
three step 


Each pair of entries in Table 1 is marked with a plus (+) if 
the difference favors the demonstration plus explanation group. Zero 
(0) indicates no difference, and minus (—) indicates a difference in 
favor of the demonstration group. Table 2 presents the tabulation 
of the frequency of + and — differences for each experimental con- 
dition. Zero differences give a credit of .5 to each method. For each 
subcategory of the experimental conditions, chi square has been com- 
puted to determine whether the observed distribution of differences 
departs significantly from the expected distribution. As was pointed 
out previously, if neither method is superior, an equal number of 
differences can be expected to favor each method in accordance with 
the familiar null hypothesis. 

In the chi square formula, 


0 = SO (fo — fr)? /f 2] , 
where 
f, = observed frequency; the values used 
in the present case are the frequency 
of plus and of minus differences in 
turn; 


f; = expected frequency, the value used is 
n/2. 


The analysis in Table 2 shows that demonstration plus explana- 
tion is a more effective method of instruction in puzzle solving than 
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TABLE 2 


Analysis in Terms of Frequencies of Differences Favoring One 
Method of Instruction over the Other 


(Based on hypothetical data from Table 1) 























Experimental n ‘, Fe f, aa er 
condition (+) (— (n/2) df=1 
Dem Exp Dem 
Motivation 
Reward 27 22.0 5.0 13.5 10.70 <.01 
No reward 27 17.0 10.0 13.5 1.81 .20—.10 
Test 
Day 1 18 11.5 6.5 9.0 1.39 .30—.20 
Day 2 18 14.0 4.0 9.0 5.56 .02—.01 
Day 15 18 13.5 4.5 9.0 4.50 .05—.02 
Puzzle Content 
Numbers 18 8.0 10.0 9.0 22 -70—.50 
Patterns 18 15.0 3.0 9.0 8.00 <.01 
Symbols 18 16.0 2.0 9.0 10.89 <.01 
Puzzle type 
1 step 18 10.5 7.5 9.0 50 .50—.30 
2 step 18 14.0 4.0 9.0 5.56 .02—.01 
3 step 18 14.5 3.5 9.0 6.72 <.01 
Total experiment 54 39.0 15.0 27.0 10.67 <.01 





*x2=S[(fo-—ft)2/fel. 


demonstration alone. The 7? and corresponding P values provide the 
investigator with sufficient material to draw conclusions about the 
differential effects of the two methods of instruction under the varied 
conditions of motivation, recall, and retention, puzzle content, and 
type. Since the purpose here is to describe a method of analysis and 
not to draw conclusions from hypothetical data, the reader will be 
spared the discussion outlined. However, it should be noted in pass- 
ing that the results of the analysis appear in a concise, understand- 
able form. 

Table 3 presents an analysis of the same data in terms of the 
reliability of mean differences employing the conventional t tech- 
nique. Table 3 strongly substantiates Table 2 with the greatest dif- 
ference being the somewhat higher confidence levels reflected in the 
P values of Table 3. A comparison in terms of the usual interpreta- 
tion of P values (.01 highly significant, .05 significant; > .05 not sig- 
nificant), indicates that in six cases out of twelve the interpretation 
would be the same, in five cases there would be a one-step difference, 
and in one case a two-step difference. The similarity of the two meth- 
ods is further indicated by the rank correlation coefficient between 
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TABLE 3 


Analysis in Terms of Mean Differences between Two Methods of Instruction 
(Based on hypothetical data from Table 1) 


























Experimental M, M, M, o o t* df P 
condition Dem Dem Exp ses 
Motivation 
Reward 29.10 26.91 2.19 2.45 .48 4.56 26 <.01 
No reward 29.76 28.75 101 1.77 85 .2.89 26 <.01 
Test 
Day 1 31.538 30.27 1.26 235 .57 2.21 417 .05—.02 
Day 2 28.42 26.34 2.08 2.33 .57 3.65 17 <.01 
Day 3 28.32 26.88 144 186 .45 3.20 17 <.01 
Puzzle content 
Numbers 25.51 25.86 15 1.59 29 238 17 .80—.70 
Patterns 28.82 26.27 2.55 2.38 .58 440 17 <.01 
Symbols 33.94 3187 2.07 1.83 .44 4.70 17 <.01 
Puzzle type 
1 step 18.21 17.72 49 94 .238 2.18 17 05—.02 
2 step 25.46 24.26 1.20 169 .41 2.938 17 <.01 
3 step 44.62 41.51 3.11 2.71 66 4.71 17 <.01 
Total experiment 29.48 27.88 1.60 2.22 .80 5.33 58 <.01 
My, - Me 
*¢-———_... 
oa/VN-1 


chi-square and ¢ of .91. If the square of this value may be taken to 
indicate the amount of overlap in the two methods of analysis, then 
the frequency of difference analysis is approximately eighty per cent 
as effective as the ¢ test. It would be pertinent to note at this point 
that the hypothetical data were set up systematically and were not 
juggled in any way later to produce greater correspondence. 

Whatever sacrifice in precision is entailed, the saving in com- 
puting time is considerable. The frequency of difference analysis 
reported here was done in less than thirty minutes without the aid of a 
calculating machine, whereas the mean difference analysis required 
about three hours with a calculator. As regards the alternate technique 
of analysis of variance, one thing is certain — it would have taken far 
more time. An even greater saving of time may be accomplished with 
the aid of the nomograph here provided. It is designed to determine the 
value of chi square when there is one degree of freedom and f; = 7/2. 
The nomograph includes a table of P values corresponding to chi 
square. 

The reader will readily visualize additional applications of the 
frequency of differences method. It is applicable wherever comparative 
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measures of effectiveness are available for methods, instruments, aids, 
chemical compounds, etc. As was noted previously, the measures of 
effectiveness may be scores on tests or other measuring instruments, 
time scores, or statistics based upon individual scores. Some experi- 
mental applications may take advantage of the fact that the technique 
takes into account only the frequency and direction of the differences 
and disregards the magnitude of the differences. 

Whatever precision is lost is due to this disregard of magnitude. 
However, this disadvantage is offset by a number of advantages which 
may be listed as follows: 


1. The method is extremely rapid, involving little com- 
putation and hence little opportunity for error. 

2. It has a wide range of applications, particularly in 
the field of research on training methods and aids. 

3. It yields results which are readily understandable 

- and can be explained to those unfamiliar with 
statistical methods. 

4. It involves no terminology or concepts outside the 
realin of conventional statistics. 
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X7=F[i1,-2,)7/n) 


where 
f, = frequency observed; 
f, = frequency expected, 


in this case f.= N/2 


DIRECTIONS. Divide the larger value of fo by N. 
Find this value on the p. scale and the Sand of N 
on the N scale. Connect°the two values with a rul- 
er and read the value of X* and the corresponding 
value of P where the ruler cuts the scales so marked, 


EXAMPLE. f, = 6 and 2, n=10. Divide 8 by 10 and 
connect the resulting value, .8, on the po scale with 
10 on the N scale. The ruler cuts the X? scale at 
3.6 and the P scale at .06 approximately. 


NOTE. The nomograph m@y be used to determine chi-square 
only when there is one degree of freedom (two cells) 
and fy = n/2. 
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AN IN DEX OF ITEM VALIDITY PROVIDING A CORRECTION 
FOR CHANCE SUCCESS 


A. P. JOHNSON 
PURDUE UNIVERSITY 


The KG Index described below is proposed for evaluation as one 
approach to the problem of providing an index giving comparable 
values for items (1) of equal discriminative power at all levels of 
difficulty and (2) of different numbers of alternative responses. 


1. The KG Index 
In 1934 Votaw* suggested that item validity comparisons for 
upper and lower 27% groups of a tested population sample be based 
on the proportion in each group who know the answer to an item 
rather than on the proportion who answer it correctly. 
He gave the general equation: 


nk — N 
ee 
in which 
x} =the number in each group who know the correct answer, 
n =the number of choices in the item, 


R =the number of correct responses to the item within a given 
group, | 


N =the number of cases-in each group. 


Votaw found in some instances that values of x were nega- 
tive. He construed these to mean that either the items in ques- 
tion were (1) so stated as to “trick” examinees into making incor- 
rect responses, (2) keyed incorrectly, or (3) suffering from some 
other serious fault rendering them invalid. 

Without presenting the generalized ratio, he shows that for any 
group all of whom respond to any item and who are completely in 


* Votaw, D. F. Notes on validation of test items by comparison of widely 


—_ ee. J. educ. Psychol., 1984, 25, 185-191. 
ne writer uses hereafter the symbol K for the number of the total test 


: ua who are estimated to know the correct: answer. 


51 











52 PSYCHOMETRIKA 


ignorance of the correct answer N/n of them would be expected to 
mark the correct answer by chance. An experimental study by Rug- 
gles of the marking by a student group of the correct choice in a 
true-false test the material of which was wholly unfamiliar to them 
is mentioned by Lee and Symonds.* The percentage of correct re- 
sponses actually obtained was 51% when the chance expectation was 
50%. Votaw shows that the probability of any invalidating condi- 
tions existing in an item can be determined for any given negative 
value of x for that item by considering N/n plus or minus its prob- 
able error. He reported the P.E. of N/n to be equal to .6745\/Npq , 
where p = 1/n and q= (1—1/n). 


In 1936, Guilford} proposed as a means of evaluating the level 
of difficulty of a test item the proportion passing corrected by an 
allowance for chance success. This corrected proportion, ¢, = 
(nk — N)/N(n—1 ), is simply the proportion of the total test popula- 
tion who may be expected to know the correct answer. In short, 
based on the total R for the item it is K (or Votaw’s x for the entire 
test population) divided by N. 

Thus 
nk —N 
K= . (1) 


n—1 





It is proposed that there be considered a new index, the KG In- 
dex, based not on contrasted upper and lower groups of 25%, 27%, 
or 33%, but on contrasted upper and lower groups equal in size for 
each item to the number of individuals estimated to know the correct 
response to that item. Suppose for a given item that 


R=200, 
Ww =8300, 
N=500, 
n= 5. 


Then 
nmR—N 5X200—500 500 
= = =—-= 125. 
n—1 4 4 
It is postulated that if those 125 persons who are estimated to know 
the correct answer to the given item are the 125 persons who are 
highest on the criterion scale, that item may be said to have a perfect 








* Lee, J. M. and Symonds, P. M. New type or objective tests: A summary 
by recent investigations (October 1931-1938). J. educ. Psychol., 1984, 25, 161- 

+ Guilford, J. P. The determination of item difficulty ane chance success is 
a factor. Psychometrika, 1936, 1, 259-264. 
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positive relationship to the criterion. If those 125 who were esti- 
mated to know the correct answer are the 125 persons who are lowest 
on the criterion scale, that item may be said to have a perfect nega- 
tive relationship to the criterion. The extent of this relationship 
could be determined by arranging the 500 papers in decreasing or- 
der of criterion scores and determining for that given item how many 
correct responses occurred among the top 125 papers. The more 
nearly that item approached perfect positive relationship with the 
criterion the more nearly the number of correct responses among 
the top 125 papers would, in this instance, approach 125. If this 
item had a perfect negative relationship to the criterion scores, all 
125 who were estimated to know the correct answer would appear 
among the lowest 125 on the criterion scale. The remaining number 
of correct answers, R — K , 200 — 125 or 75, would be distributed 
among the 500 — 125 or 375 papers remaining. Chance expectations 
with 75 rights among 375 papers would be 1 in 5 throughout the 
upper 375 papers on the criterion scale. Thus the upper 125 papers 
would be expected to include about 125/5 or 25 correct responses by 
chance. 

The KG Index can be developed as follows as a convenient sum- 
mary of how closely the actual responses approximate the perfect 
(or deviate from the chance) relationship with the criterion scores, 
as that relationship is defined above. 

Let us use the symbols, Ry for the number of right responses in 
the upper group and R, for the number of right responses in the 
lower group. In the perfect positive relationship postulated above 
the ratio R,/K should equal 1.0, and the ratio R,/K should equal 
1/n (since R, should equal K/n).* 

In a chance relationship both R,/K and R,/K should equal 1/7 .+ 

In a perfect negative relationship as postulated above, the ratio 
R./K should equal 1/nt (since R; should equal K/n), and the ratio 
R,/K should equal 1.0. 

It is possible to obtain a positive index for so-called positive re- 
lationship, a zero index for chance relationship and a negative index 
for the so-called negative relationship by subtracting the ratios Ry/K 
and R,/K. This difference of proportions, Ry — R,/K , is essentially 
the same as the U-L Index proposed by the writer. The two are iden- 


* As Votaw indicates, the value N/n or in this instance K/n may well vary, 
as estimated by the formula S. E. of K/n = VK X 1/n X (n — 1)/n. According 
to the probability tables for the normal curve of error, in 9973 cases out of 10000, 
K/n should not vary beyond + 3 VK X 1/n X (n —1)/. It is thus possible that 
values of less than 1/n may occur. 


Idem. 
t Idem. 
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tical when the value of K is .27N . The difference between R, — R,/K 
and the U-L Index is that K is not fixed at .27N but may vary from 
0 to N. Whenever K exceeds N/2, however, the upper and lower K 
groups overlap by the amount 2(K — N/2) or 2K — N. For the sake 
of simplicity, the practical question of how to handle omitted re- 
sponses is deferred until later. If the group considered to be guess- 
ing the correct answer (i.e., those not knowing it) is designated by 
G,then N — K=G. If, when K is greater than N/2, the G group 
rather than the K group is made the basis for upper vs. lower group 
comparison, the resulting difference between actual correct responses 
in upper and lower groups is the same and the groups do not over- 
lap. The divisor (G) then is no greater than the effective maximum 
value of the groups whose numbers of correct responses are sub- 
tracted. The symbol B can be substituted for K in the ratio 
Ry, — R,/K, where B (i.e., the base group) is K , when K does not 
exceed N/2, and G when K exceeds N/2. 

Since for perfect positive relationship as defined above the ex- 
pected value of Ry — R,/B is 1—1/n or (n —1)/n and since for per- 
fect negative relationship the expected value of Ry — R,/WN is 
1/n — 1, the ratio in this form has a maximum theoretical value 
dependent on the number of choices. This dependence can be elimi- 
nated by multiplying the ratio R, — R,/B by n/(n — 1). This expres- 
sion is the KG Index: 


n(Ry — R,) 
KG Index = ——————-. (2) 
(n—1)B 


2. Computation of the KG Index 

As a first step, the test papers are arranged from highest to 
lowest on the criterion scale to be used as the standard of validity. 
The number of persons marking the correct response and the number 
marking all incorrect responses in the upper 30% of papers and in 
the lower 30% of papers is determined by graphic item count or other 
means. In order to provide most efficiently for data needed later, it 
is suggested that the papers be divided into successive groups as fol- 
lows: upper and lower 6%, next upper and lower 4%, next upper and 
lower 5%, next upper and lower 5% and the remaining upper and 
lower 10% to total 30% in each.* With the graphic item counter, for 
instance, each upper group is run through separately and a separate 
count obtained on each. The same procedure is followed with the 
lower groups. For example, by adding the data of the 6% and the 


* The use of these suggested groups is believed to provide sufficient accuracy 
while avoiding the necessity of computing different base groups for each item. 
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next 4% groups the information necessary for computing the KG 
Index based on the upper and lower 10% groups for certain specific 
items can be obtained. Similarly the data for any desired base groups 
from 6% to 30% can be found for each item. 

The average number of correct and the average number of in- 
correct responses in the upper and lower 30% groups provide a prac- 
ticable, if not a precise, means of determining the difficulty level of 








each item by the formula: 


K=R 


Ww 


(n—1)- 


(3) 


Divided by .3 N and when there are no omissions, this value becomes 
essentially equivalent to Guilford’s ¢, . 
Table 1 gives a suggested range of values of K in terms of N 
for which upper and lower 30% groups provide convenient base 
groups and for which base groups of smaller size are suggested. 


Table 1 


Recommended Base Groups and Comparable Ranges of K and R* 


Base Range of K Range of R in percentages of N for specified 
Group terms of N numbers of choices 
n=5choices n—A4choices n=—=8 choices n= 2 choices 
.30N .25 — .75N 40.0— 80.0 43.8—81.2 50.0—83.38 62.5 —87.5 
.18 — .24N 84.4—39.9 388.5—43.7 45.4—49.9  59.0— 62.4 
.20N -76 — .82N 80.1—85.6 81.3—86.5 83.4—87.9 87.6—91.0 
18 — .i7N 80.4—384.3 348—38.4 42.0—45.3  56.5—58.9 
.15N .83 — .87N 85.7—89.6 86.6—90.2 880—913 91.1—93.5 
.08 — .12N 26.4—30.38 31.0—34.7 38.6—41.9 540—56.4 
.10N .88 — .92N 89.7—93.6 90.3—94.0 91.4—94.7 93.6—96.0 
.04 — .07N 23.2— 26.3 280—30.9 360—38.5  52.0—53.9 
.06N .938 — .96N 93.7—96.8 941—97.0 948—97.3 96.1—98.0 
0 —.038N 20.0— 28.1 25.0—27.9 33.3—35.9 50.0—51.9 
0 .97-1.00N 96.9 — 100 97.1—100 97.4—100 98.1 — 100 


(Underlined values represent expected percentages correct when all answers 
are marked according to chance.) 


* The suggested base group values of this table have been arrived at on the 
basis of both theoretical and practical considerations; except for a very few spe- 
cial test construction situations it is expected that they will prove quite satisfac- 
tory. 

The values of R have been derived from the basic formula 


K =(nR — N/(n—1) solved for R, namely, R = (rn —1)K/n + N/n. 
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In most instances it is believed that the majority of items will 
be of such difficulty that the upper and lower 30% groups will serve. 
If the suggested breakdown of groups has been followed, the data 
necessary for computing the KG Index will be readily available for 
all items. 

The following data for one item of a five-choice test will illus- 
trate the method for computing the KG Index when the number of 
omissions is negligible: 








R= 268 
W= 96 
O0= 4 (negligible) 
N = 368 
n= 5 
K=R— A = 268 — = = 268 — 24— 244. 
n—1 (5—1) 


In terms of N , K = 244/368 = .663N . 


Entering the second column of Table 1, “Range of K in terms of 
N ,” we note that K for this item falls within the range .25 — .75N 
corresponding to a base group of .30 N (see the first column). 

Thus the desired base group equals .30 X 368 or 110.4. This 
figure is rounded off to 110. From the total count for the upper 30% 
of the test papers and from the total count for the bottom 30% of 
the test papers, the number of correct responses in these groups is 
found. The actual counts were: 


Ry = 107 base group, B= 110 
R,= 47 
n(Ry — R,) 
KG Index = —————_-. 
(n—1)B 
Substituting, we have 
5 (107 — 47) 
KG Index = ———————-= .68.. 
4X 110 


For this item the tetrachoric 7 based on a 66% vs. 34% split was 
.67; the tetrachoric 7 based on a 50% vs. 50% split was .60 and Guil- 
ford’s ¢* based on contrasted top and bottom 25% groups was .66. 

The agreement of these values is not always so close, as is illus- 
trated by the data for a two-choice item where chance successes tend 
to attenuate the ¢ coefficient (and similarly other indices of relation- 


* Guilford, J. P. The phi coefficient and chi square as indices of item validity. 
Psychometrika, 1941, 6, 11-19. 
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ship not making a correction for chance successes). For a specific 
two-choice item, N = 401, R = 239, W = 158, 0 = 4, and since K = 
.202N , B = .20N. Ry and R, were 62 and 36, respectively; thus 
n(Ry — R,)/(n —1)B=.65. The corresponding ¢ was 1.26; no val- 
ues of the tetrachoric r were computed. 

Formulas (1) and (2) are not applicable where the number of 
omissions is appreciable, for they assume no omissions. 

Formula (3), K= R — W/(n — 1), can be used regardless of 
the number of omissions. It is useful to determine K when it is nec- 
essary to item-analyze speeded tests in which the number of omis- 
sions among later items is appreciable. When K is expressed in terms 
of N, N should include only those who read the item, not the ‘“‘non- 
reads” as defined below. 

Omissions on speeded tests are usually of two types: (a) those 
- by persons who read the item but fail to mark any choice, and (b) 
those by persons who have not read as far in the test as the end of 
that item. Frederick B. Davis in a communication to the writer sug- 
gests a method of determining the “non-reads” directly. The test 
papers are scored and arranged in rank order according to the cri- 
terion score. The top 6%, next 4%, next 5%, next 5%, and remain- 
ing 10%, to total the top 30% of the test papers, are segregated as are 
similar bottom groups. Next the papers of each upper group are 
gone through separately to determine which item on each paper is 
the last one marked. A tally is then made opposite the number of 
the next item, for that one is presumably the first item not read by 
the subject. This tally is the basis for a cumulative frequency table 
of the ‘‘non-reads” for each particular item among the top 6%, top 
10% (6% + 4%), top 15%, top 20%, and top 30% of the test papers, 
A similar count is made of the “non-reads” among the lower 6%, 
10%, 15% 20%, and 30% groups. 

By graphic item count on the International Test Scoring machine 
or by other means, the number of right responses and the number of 
wrong responses in each top group to each item is found. The omits 
may be obtained, if desired, by subtracting from the number of pa- 
pers within the appropriate upper group the rights plus wrongs plus 
‘non-reads.” Similarly the omits among the several lower groups 
may be obtained, if desired. 

Where the “non-reads” represent an appreciable proportion of 
the appropriate base groups, the following modification of the basic 
formula for the KG Index is suggested: 
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n Ry R, 
— (= —1/\(B—“non-reads’,) (B— “non-reads”,) : 


in which n, R,, R,, and B have the same meanings as in Formula 


(2), 





“non-reads”’, = “non-reads” in upper base group, and 
“non-reads’’, = “‘non-reads” in iower base group. 


The postulated standard of perfect positive and of perfect nega- 
tive relationship against which each item is evaluated by the KG In- 
dex is based upon probability theory. In the strictest sense, it is 
valid only when (a) the number of cases is large and (b) when all 
of the alternative responses are equally enticing to those examinees 
in ignorance of the subject matter of the question. Although these 
conditions are not too frequently met in practice, the KG Index is 
believed to possess some possible usefulness as an item validity index 
giving closely comparable values for items (1) of different levels of 
difficulty but equal discriminative power and/or (2) of different num- 
bers of alternative responses. 








(4) 


la 
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ROOK REVIEWS : 


HAROLD CRAMER Mathematical Methods of Statistics. Princeton: Princeton 
University Press, 1946. Pp. xvi + 575. 


It is the purpose of the author to present a logical development of the method 
of mathematical statistics, which presupposes on the part of the reader a mathe- 
matical knowledge of only calculus, algebra, and analytic geometry. The first sec- 
tion of the book, containing 137 pages, is devoted to the presentation of the various 
topics in higher mathematics which are necessary for the proof of theorems in the 
later sections of the book. This section, beginning with point-set theory, contains 
a logical development of Lesbegue measure, the Lesbegue integral, theory of ad- 
ditive set functions, and the Lesbegue-Stieltjes integral. The procedure through- 
out is to make a detailed, rigorous development for the simplest case and to indi- 
cate briefly the possible generalizations of the theorems to less restricted condi- 
tions and to multidimensional space. At the end of the first part of the book, one 
chapter each is devoted to characteristic functions and matrix theory. A final 
chapter of this part covers such topics as Stirling’s formula and Beta and Gamma 
functions. 

This first section of the book may be understandable to European students 
who have finished a course in calculus, but in terms of American education in 
mathematics, it requires more background than that. Especially it requires a 
familiarity with the rigorous development of the calculus and with the manipula- 
tion of the functions of complex variables. The first section is, nevertheless, ex- 
tremely valuable because the selection of pertinent material from function theory 
provides a background for statistics which would be very difficult for anyone who 
is not a professional mathematician to obtain. 

The second part of the book discusses random variables and probability func- 
tions, both univariate and multivariate. The two introductory chapters deal with 
the fundamental basis of probability theory. Cramér develops the concept of 
probability from the empirical knowledge of frequency ratios but does not actually 
base his mathematics upon frequency ratios. Instead probability is a mathematical 
model, a function of point sets, consistent within itself, but designed to have a 
reasonable correspondence with the properties of frequency ratios. The discussion 
of random variables includes the important binomial, Poisson, normal, chi-square, 
t, z, and incomplete Beta distributions as well as explanations of Gram-Charlier 
series and the Pearson system of distributions. General properties of multivariate 
distributions are also discussed. 

The third section, on statistical inference, includes two main types of mate- 
rial, general theory of statistical tests and inferences and the details of particu- 
lar sampling distributions. The section on the theory of estimation and the chap- 
ter on the theory of testing hypotheses are particularly valuable. The discussion 
of the topic, however, seems less complete than the other sections of the book. 
Cramér probably felt that the other material was mathematically more funda- 
mental. 

For psychologists the book will be very useful, almost indispensable, to those 
who are trying to develop real mathematical sophistication in statistics. After 
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careful study of Cramér, the reader should be in a position to understand more 
advanced texts and the periodical literature. For the person who wishes to become 
a sophisticated user of statistics, but not necessarily a mathematical statistician, 
Cramér is also valuable but less so. His discussion is neither complete enough nor 
non-technical enough to give such a reader an understanding of the full impli- 
cations of such concepts as “uniformly most powerful test,” or “unbiassed test.” 
The book is not, nor was it intended to be, a handbook of statistical tests. In that 
sense also it is not of maximum value for the practical statistician. There is, how- 
ever, such a scarcity of integrated discussions of the modern theory of statistical 
tests and inferences that even for the less mathematical reader the book is valu- 
able. 
Fels Research Institute for the Study of Human Development 
ALFRED L. BALDWIN 


DAVIS, FREDERICK B. Item-Analysis Data: Their Computation, Interpreta- 
tion, and Use in Test Construction. Harvard Education Papers Number 2. 
Cambridge: Graduate School of Education, Harvard University, 1946. Pp. 
v + 42, 


This monograph does not undertake to provide a complete review and discussion 
of techniques of item-analysis. It is limited to (1) the exposition of one procedure 
which the author has developed for analyzing and expressing the difficulty and 
discriminating power of items and (2) critical remarks on certain specific prob- 
lems in the interpretation and use of item-analysis data. In addition, a four-page 
bibliography on item-analysis is provided. 

The distinctive feature of the procedures presented by the author is that they 
yield indices scaled in presumably equal units, with a range from 0 to 100. In the 
case of difficulty, defined as per cent of the group knowing the answer to an item, 
this involves assuming that ability is normally distributed in the group studied and 
then converting percentages into abscissa values of the normal probability curve. 
These are then multiplied by an appropriate constant to give the desired range of 
seores. In the case of discrimination indices, correlation coefficients are translated 
into values of Fisher’s z, and these are multiplied by the constant which yields 
a value of 100 corresponding to 99 per cent success in the upper 27 per cent of 
cases and 1 per cent success in the lower 27 per cent. 

Both difficulty and discrimination indices are extracted from the per cent of 
successes, failures, and omissions in the top and bottom 27 per cent of the group 
on the criterion measure (usually total score). A chart has been prepared which 
provides the difficulty and discrimination indices for any pair of percentages of 
success (after correction for chance) in the two groups. Thus, -the procedure and 
the table are an elaboration of those developed by Flanagan earlier. . 

The procedure of obtaining item indices from per cent of success in the upper 
and lower 27 per cent has been found to be an efficient and practical procedure, 
especially where an IBM test-scoring machine with a graphic item counter attach- 
ment is available. The use of scaled values for difficulty and discrimination indices 
gets away from the non-linearity of the number scale both of proportions and of 
correlation coefficients. However, the numeral values of the proposed new indices 
will be entirely unfamiliar to the user and will present some difficulty of interpre- 
tation on that account. Their value would appear to be chiefly for individuals who 
are going to do a great deal of item analysis and in cases where the item indices 
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are to be used for comparative purposes within a test-development organization. 
The indices would probably prove a source of confusion in published reports. 

Chapter IV of the monograph presents a stimulating discussion of various 
problems connected with the use and interpretation of item-analysis data. In 
general, the tenor of these remarks is to emphasize that item-analysis is a valuable 
supplementary aid to but not a substitute for good item-writing and editing, and 
that item-analysis data should be used with insight and discretion rather than 
mechanically. Though the discussions are brief and suggestive rather than defini- 
tive in many cases, they should stimulate the reader to critical thought on a num- 
ber of phases of the item-analysis problem. 


Teachers College, Columbia University ROBERT L. THORNDIKE 

















SPECIAL NOTICE 


The U. S. Civil Service Commission has announced an exami- 
nation for filling Research Psychologist positions in Washington, 
D. C., and throughout the United States. 

The salaries for Research Psychologist positions range from 
$4,902 to $9,975 a year. The duties of the positions are of a highly 
responsible and technical] nature. To qualify, applicants must have had 
4 years of progressive professional experience in conducting or par- 
ticipating in important research projects in the field of psychology. 
This experience must have been at successively higher levels of respon- 
sibility for the higher grades, and must show ‘he applicant’s ability to 
plan, direct and coordinate research program: of considerable scope 
and complexity. Applicants for the highest ssiary level must have 
earned recognition as leaders in the field of psychology. Graduate 
study in psychology may be substituted year for year for 3 years of 
the required experience. No written test is required for this exami- 
nation. The age limit of 62 years is waived for persons entitled to 
veteran preference. 

Applications for the Research Psychologist examination will be 
accepted until further notice. However, some positions will be filled 
immediately. Persons interested in these positions should apply at 
once. Information and application forms may be obtained at most 
first- and second-class post offices, from Civil Service regional offices, 
and from the U. S. Civil Service Commission, Washington, D. C. 
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